Mr. Karan Mitra Mr. Prabhakaran Mr. Devanandh
Ms. Sulekha Aloorravi
A house value is simply more than location and square footage. Like the features that make up a person, an educated party would want to know all aspects that give a house its value. For example, you want to sell a house and you don’t know the price which you can take — it can’t be too low or too high. To find house price you usually try to find similar properties in your neighbourhood and based on gathered data you will try to assess your house price.
Take advantage of all of the feature variables available below, use it to analyse and predict house prices.
22: total_area: Measure of both living and lot
price: Price is prediction target
# basic Libraries import
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.plotly as py
import plotly.graph_objs as go
%matplotlib inline
import plotly
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from sklearn.ensemble import RandomForestClassifier
%matplotlib inline
print (__version__) # requires version >= 1.9.0
init_notebook_mode(connected=False)
import warnings
warnings.filterwarnings("ignore")
# Importing dataset
house_df = pd.read_csv('innercity.csv')
# Viewing first 10 entries in the dataframe
house_df.head(10)
# analyzing the size of the dataframe and the variable datatypes
house_df.info()
df_size=house_df.shape
There are a total of 21613 data points.
The input variables are all either integer or float datatype.
The target variable(price) is of the datatype integer.
Hence, we would look into regression based models to evaluate the integer target varible.
The "dayhours" column is of object type. Let's eyeball the data.
# Eyeballing dayhours column
house_df['dayhours'].head(5)
The first 4digits quanitfy the year, the next 2 digits the month and the next 2 digits the date and the last 7 digits probably the time stamp. Hence, splitting this variable into respective variable groups
# copying the source dataframe onto a new dataframe for manipulation
house_df_new=house_df.copy()
# creating a new column to mimic the timeframe
house_df_new['sold_date_full']=house_df_new['dayhours'].str[:8].astype('int64')
# Sold date versus price - Pairplot visualization
sns.pairplot(house_df_new,x_vars='sold_date_full',y_vars='price')
There are clusters forming up in the time series. Hence, let's split them up into indivdual features - date, month and year
# Creating separate features for sold date,month and year
house_df_new['sold_year']=house_df_new['dayhours'].str[:4].astype('int64')
house_df_new['sold_month']=house_df_new['dayhours'].str[4:6].astype('int64')
house_df_new['sold_date']=house_df_new['dayhours'].str[6:8].astype('int64')
# Evaluating feature - sold_year
house_df_new['sold_year'].head(5)
# evaluating feature - sold_month
house_df_new['sold_month'].head(5)
# evaluating feature- sold_date
house_df_new['sold_date'].head(5)
# having split the dayhours data, dropping the dayhours column from the dataframe
house_df_new=house_df_new.drop('dayhours',axis=1)
# looking into the new dataframe after dropping dayhours feature
house_df_new.info()
# Check for NA values and count for each feature
house_df_new.isna().sum(axis=0)
There are no missing values in the dataset.
Univariate analysis – data types and description of the independent attributes which should include (name, meaning, range of values observed, central values (mean and median), standard deviation and quartiles, analysis of the body of distributions / tails, missing values, outliers
# bird's eye view of the numerical distribution of the dataframe
house_df_new.describe().transpose()
To get more clarity let's evalaute each variable separately
Check the input variable distribution and outliers, Incase of outlier presence, evaluate importance through corelation analysis with target price
# Verifying the distribution and histogram of the variable
sns.distplot(house_df_new['cid'])
house_df_new.head(3)
There are two distinct distribution peaks in the column. Though this variable is used just as an identification variable, the histogram shows there are repeat/duplicate entries
# creation of a copy dataframe to evaluate the duplicates
house_df_f=house_df_new.copy()
# duplicate entry extraction
house_df_f['cid_id']=house_df_new['cid'].duplicated()
# size of the duplicate dataframe
house_df_f.loc[house_df_f['cid_id'] == True].shape
There are a total of 177 duplicate entries in the dataframe. Let's look into the nature of the duplicates and check if they are really duplicates
# Listing the top 5 duplicate entries
house_df_f.loc[house_df_f['cid_id'] == True].sort_values(by='cid').head(5)
# returning all instances of the first 5 duplicate entries
house_df_f.loc[house_df_f['cid'].isin(['1000102','7200179','109200390','123039336','251300110'])].sort_values(by='cid')
There are certain repeated entries indicating the same property has been bought and resold, since there is no change in the other parameters. Hence, we are retaining them in the dataframe.
# Plotting boxplot to detect outliers
data=[(go.Box(x=house_df_new['cid'],name='House property identifier',showlegend=True))]
plotly.offline.iplot(data,image='jpeg')
sns.boxplot(house_df_new['cid'])
plt.show()
# corelation of cid vs price
house_df_new['cid'].corr(house_df_new['price'])
No outliers present in this variable.
Note: This is an identification variable only, also implicated by no corelation with price, hence, this variable can be dropped for regression analysis.
#evaluating the unique entries in the variable list
house_df_new['room_bed'].sort_values().unique()
# Verifying the distribution of the variable
sns.distplot(house_df_new['room_bed'])
plt.show()
The data is right skewed indicating outliers, and also there are distinct peaks indicating clusters present in the dataframe
# Plotting boxplot to detect outliers
data=[(go.Box(x=house_df_new['room_bed'],name='room_bed',showlegend=True))]
plotly.offline.iplot(data)
sns.boxplot(house_df_new['room_bed'])
plt.show()
The no. of rooms vary from 0 to 33.
The outliers are present below 2 and above 5 as per the box plot. Outliers treatment is required here.
# Evaluating the outliers in number of bedrooms - Case A
# case A
room_bed_outlier=house_df_new[(house_df_new.room_bed>5)|(house_df_new.room_bed<2)]
room_bed_outlier.shape
# Case A corelation with price
room_bed_outlier['room_bed'].corr(house_df_new['price'])
There are 546 entries with no. of bedrooms >5 and <2. It has a very low corelation with price. Let's check the number of entries with no. of bedrooms >5 and <1.
# Case B
room_bed_outlier_1=house_df_new[(house_df_new.room_bed>5)|(house_df_new.room_bed<1)]
room_bed_outlier_1.shape
# Case B corelation with price
room_bed_outlier_1['room_bed'].corr(house_df_new['price'])
There are 347 entries with no. of bedrooms >5 and <1. It has no corelation with price. Hence, we can remove them from the analysis dataframe. If more improvement to model accuracy is required, then Case A can be removed.
#evaluating the unique entries in the variable list
house_df_new['room_bath'].sort_values().unique()
# evaluating the number of unique entries
house_df_new['room_bath'].sort_values().unique().shape
The no. of room_bath vary from 0 to 8 with a total of 30 unique entries and there are decimal values too.
# Verifying the distribution of the variable
sns.distplot(house_df_new['room_bath'])
plt.show()
The data is right skewed indicating outliers, and also there are distinct peaks indicating clusters present in the dataframe
# Plotting boxplot to detect outliers
data=[(go.Box(x=house_df_new['room_bath'],name='room_bath',showlegend=True))]
plotly.offline.iplot(data)
sns.boxplot(house_df_new['room_bath'])
plt.show()
The outliers are present below 0.75 and above 3.5 as per the box plot. Outliers treatment is required here.
# creating outlier dataframe
room_bath_outlier=house_df_new[(house_df_new.room_bath>3.5)|(house_df_new.room_bath<0.75)]
room_bath_outlier.shape
# corelation of no. of bathrooms/bedrooms with price
room_bath_outlier['room_bath'].corr(house_df_new['price'])
There are 571 outlier entries and have moderate positive corelation only with price. Hence based on modelling accuracy we can decide if to retain or remove the outliers.
#evaluating the unique entries in the variable list
house_df_new['living_measure'].sort_values().unique()
# evaluating the number of unique entries
house_df_new['living_measure'].sort_values().unique().shape
There are 1038 unique entries of living measure
# Verifying the distribution of the variable
sns.distplot(house_df_new['living_measure'])
plt.show()
The data is normally distributed and it is right skewed indicating outliers.
# Plotting boxplot to detect outliers
data=[(go.Box(x=house_df_new['living_measure'],name='living_measure',showlegend=True))]
plotly.offline.iplot(data)
sns.boxplot(house_df_new['living_measure'])
plt.show()
The outliers are present above 4230 as per the box plot.
# creating outlier dataframe
living_measure_outlier=house_df_new[(house_df_new.living_measure>4230)]
living_measure_outlier.shape
# corelation of living measure with price
living_measure_outlier['living_measure'].corr(house_df_new['price'])
There are 572 outlier entries. Also they have moderate postivie corelation with price. Hence, decision to remove or retain need to be taken based on model accuracy.
#evaluating the unique entries in the variable list
house_df_new['lot_measure'].sort_values().unique()
# evaluating the number of unique entries
house_df_new['lot_measure'].sort_values().unique().shape
There are 9782 unique entries of lot measure
# Verifying the distribution of the variable
sns.distplot(house_df_new['lot_measure'])
plt.show()
The data is normally distributed and it is right skewed indicating outliers. The histogram shows an abnormaly high number of instances in the lot measures of smaller size, indicating maximum presence of such properties.
# Plotting boxplot to detect outliers
data=[(go.Box(x=house_df_new['lot_measure'],name='lot_measure',showlegend=True))]
plotly.offline.iplot(data)
sns.boxplot(house_df_new['lot_measure'])
plt.show()
The outliers are present above 19141 as per the box plot.
# creating outlier dataframe
lot_measure_outlier=house_df_new[(house_df_new.lot_measure>19141)]
lot_measure_outlier.shape
# corelation of lot measure with price
lot_measure_outlier['lot_measure'].corr(house_df_new['price'])
There are 2425 outlier entries. They have no corelation with price. Hence, they can be removed from the analysis dataframe.
#evaluating the unique entries in the variable list
house_df_new['ceil'].sort_values().unique()
# evaluating the number of unique entries
house_df_new['ceil'].sort_values().unique().shape
There are 6 unique entries of total floors in house. There are decimal entries too.
# Verifying the distribution of the variable
sns.distplot(house_df_new['ceil'])
plt.show()
The data has 4 peaks indicating 4 clusters. Maximum occurance in the histogram is 1 floor followed by 2 floors. The data is right skewed indicating outliers in the higher no.of floors.
# Plotting boxplot to detect outliers
data=[(go.Box(x=house_df_new['ceil'],name='ceil',showlegend=True))]
plotly.offline.iplot(data)
sns.boxplot(house_df_new['ceil'])
plt.show()
Contrary to the inference of distribution plot, there are no outliers in the no. of floors in box plot. Hence, no outlier treatment required.
#evaluating the unique entries in the variable list
house_df_new['coast'].sort_values().unique()
# evaluating the number of unique entries
house_df_new['coast'].sort_values().unique().shape
This variable is of categorical type indicating if the property is facing a waterbody or not.
# Verifying the distribution of the variable
sns.distplot(house_df_new['coast'])
plt.show()
The data is extremely right skewed. The histogram shows an abnormaly high number of instances of the properties without facing a waterbody.
# Number of houses not facing waterbody
coast_no=house_df_new[house_df_new.coast==0].shape
coast_no[0]
# Number of houses facing waterbody
coast_yes=house_df_new[house_df_new.coast==1].shape
coast_yes[0]
# Percentage of houses facing waterbody
print ('%3.2f'%(coast_yes[0]/df_size[0]*100),'%')
The data shows only 163 houses are facing a waterbody, while the remaining is not. That is only 0.75% of the total houses are facing the waterbody.
# Plotting boxplot to detect outliers
data=[(go.Box(x=house_df_new['furnished'],name='Furnished',showlegend=True))]
plotly.offline.iplot(data)
sns.boxplot(house_df_new['coast'])
plt.show()
As the number of houses facing waterbody are only 0.75% of the total population in the database, they have been marked as outliers. Hence, let's evaluate the impact of facing waterbody against the price.
# Waterbody facing status versus price - Pairplot visualization
sns.pairplot(house_df_new,x_vars='coast',y_vars='price')
# Furnished status versus price - Correlation analysis
house_df_new['coast'].corr(house_df_new['price'])
The variable furnished has a low corelation with the target price. Also, as the population of this sample is low, we can remove them from analysis.
# creating outlier dataframe
coast_outlier=house_df_new[(house_df_new.coast==1)]
coast_outlier.shape
#evaluating the unique entries in the variable list
house_df_new['sight'].sort_values().unique()
# evaluating the number of unique entries
house_df_new['sight'].sort_values().unique().shape
This variable is of categorical type indicating how many times the property has been viewed. 0 indicates the property has not been viewed earlier, while maximum number of views is 4.
# Verifying the distribution of the variable
sns.distplot(house_df_new['sight'])
plt.show()
The data is extremely right skewed. The histogram shows an abnormaly high number of instances of the properties without previous viewings.
# Number of houses not viewed previously : Case C0
sight_no=house_df_new[house_df_new.sight==0].shape
sight_no[0]
# Number of houses previously viewed once : Case C1
sight_once=house_df_new[house_df_new.sight==1].shape
sight_once[0]
# Number of houses previously viewed twice : Case C2
sight_twice=house_df_new[house_df_new.sight==2].shape
sight_twice[0]
# Number of houses previously viewed thrice : Case C3
sight_thrice=house_df_new[house_df_new.sight==3].shape
sight_thrice[0]
# Number of houses previously viewed fourtimes : Case C4
sight_four=house_df_new[house_df_new.sight==4].shape
sight_four[0]
# No. of houses viewed previously
print (sight_once[0]+sight_twice[0]+sight_thrice[0]+sight_four[0])
# Percentage of houses viewed previously
print ('%3.2f'%((sight_once[0]+sight_twice[0]+sight_thrice[0]+sight_four[0])/df_size[0]*100),'%')
The data shows only 2164 houses were viewed previously accounting for 9.83% of the total house population.This is a signifant number. Hence, the impact of pricing needs to be evaulated for decision on outlier treatment.
# Plotting boxplot to detect outliers
data=[(go.Box(x=house_df_new['sight'],name='sight',showlegend=True))]
plotly.offline.iplot(data)
sns.boxplot(house_df_new['sight'])
plt.show()
As the number of houses viewed previously are only 9.83% of the total population in the database, they have been marked as outliers. Hence, let's evaluate the impact of property viewed previously against the price.
# House previously viewed status versus price - Pairplot visualization
sns.pairplot(house_df_new,x_vars='sight',y_vars='price')
# House previously viewed status versus price - Correlation analysis
house_df_new['sight'].corr(house_df_new['price'])
The variable furnished has a low corelation with the target price. However, the call to remove or retain them can be taken based on modelling accuracy since they the number of entities are of sizeable number
# creating outlier dataframe
sight_outlier=house_df_new[(house_df_new.sight>0)]
sight_outlier.shape
#evaluating the unique entries in the variable list
house_df_new['condition'].sort_values().unique()
# evaluating the number of unique entries
house_df_new['condition'].sort_values().unique().shape
This variable is of categorical type indicating the overall condition of the property with value ranging from 1 to 5. Probably 1 indicating poor condition and 5 indicating very good condition.
# Verifying the distribution of the variable
sns.distplot(house_df_new['condition'])
plt.show()
The data has three peaks. With the mean rating of 3 the maximum number of the occurances, the distribution is centrally spread.
# Plotting boxplot to detect outliers
data=[(go.Box(x=house_df_new['condition'],name='condition',showlegend=True))]
plotly.offline.iplot(data)
sns.boxplot(house_df_new['condition'])
plt.show()
The boxplot shows that the condition 1 is an outlier. Let's quantify the number of instances of condition 1.
# creating outlier dataframe
condition_outlier=house_df_new[(house_df_new.condition==1)]
condition_outlier.shape
There are only 30 outlier entitites. Let us remove them from the analysis dataframe.
#evaluating the unique entries in the variable list
house_df_new['quality'].sort_values().unique()
# evaluating the number of unique entries
house_df_new['quality'].sort_values().unique().shape
This variable is of categorical type indicating the grade of the property with value ranging from 1 to 13.
# Verifying the distribution of the variable
sns.distplot(house_df_new['quality'])
plt.show()
The data has 7 peaks. With the mean rating of 7 the maximum number of the occurances, the distribution is centrally spread.
# Plotting boxplot to detect outliers
data=[(go.Box(x=house_df_new['quality'],name='quality',showlegend=True))]
plotly.offline.iplot(data)
sns.boxplot(house_df_new['quality'])
plt.show()
The boxplot shows that the quality <6 and >9 are outliers. Let's quantify the number outlier instances.
# creating outlier dataframe
quality_outlier=house_df_new[(house_df_new.quality<6)|(house_df_new.quality>9)]
quality_outlier.shape
There are 1911 outlier entitites. Let us look into the respective corelation with price.
# correlation of quality with price
quality_outlier['quality'].corr(house_df_new['price'])
The outlier variables have a moderate positive corelation with price. Hence, based on model accuracy the decision to remove or retain the outliers can be made
# evaluating the number of unique entries
house_df_new['ceil_measure'].sort_values().unique().shape
There are 946 unique entries in the variable
# Verifying the distribution of the variable
sns.distplot(house_df_new['ceil_measure'])
plt.show()
The data is normally distributed with moderate right skewedness indicating presence of outliers in high ceil measures.
# Plotting boxplot to detect outliers
data=[(go.Box(x=house_df_new['ceil_measure'],name='ceil_measure',showlegend=True))]
plotly.offline.iplot(data)
sns.boxplot(house_df_new['ceil_measure'])
plt.show()
The boxplot shows that ceil measure > 3740 are outliers. Let's quantify the number of instances.
# creating outlier dataframe
ceil_measure_outlier=house_df_new[(house_df_new.ceil_measure>3740)]
ceil_measure_outlier.shape
# corelation of ceil_measure with price
ceil_measure_outlier['ceil_measure'].corr(house_df_new['price'])
There are 611 outlier entitites and they have moderate positive corelation with price. Hence, based on model accuracy we can choose to remove or retain them.
#evaluating the unique entries in the variable list
house_df_new['basement'].sort_values().unique()
# evaluating the number of unique entries
house_df_new['basement'].sort_values().unique().shape
There are 306 unique entries in the variable
# Verifying the distribution of the variable
sns.distplot(house_df_new['basement'])
plt.show()
The data has two peaksright skewedness indicating presence of outliers in high basement measures. Also histogram indicates that most of the houses do not have basement.
# Plotting boxplot to detect outliers
data=[(go.Box(x=house_df_new['basement'],name='basement',showlegend=True))]
plotly.offline.iplot(data)
sns.boxplot(house_df_new['basement'])
plt.show()
The boxplot shows that basement measures > 1400 are outliers. Let's quantify the number of instances.
# creating outlier dataframe
basement_outlier=house_df_new[(house_df_new.basement>1400)]
basement_outlier.shape
# corelation of basement size with price
basement_outlier['basement'].corr(house_df_new['price'])
There are 496 outlier entitites and they have a moderate positive corelation with price. Hence, based on modelling accuracy let's take a call to remove or retain them.
#evaluating the unique entries in the variable list
house_df_new['yr_built'].sort_values().unique()
# evaluating the number of unique entries
house_df_new['yr_built'].sort_values().unique().shape
There are 116 unique entries in the variable. The earliest built house is in 1900 while the latest being built in 2015.
# Verifying the distribution of the variable
sns.distplot(house_df_new['yr_built'])
plt.show()
The data shows that the houses being built were following an increase in trend from 1900 and peaking in the 2000s.
# Plotting boxplot to detect outliers
data=[(go.Box(x=house_df_new['yr_built'],name='yr_built',showlegend=True))]
plotly.offline.iplot(data)
sns.boxplot(house_df_new['yr_built'])
plt.show()
The boxplot shows there are no outliers in the data.
#evaluating the unique entries in the variable list
house_df_new['yr_renovated'].sort_values().unique()
# evaluating the number of unique entries
house_df_new['yr_renovated'].sort_values().unique().shape
There are 70 unique entries in the variable. '0' would be typically refering that the property was not renovated.
# Verifying the distribution of the variable
sns.distplot(house_df_new['yr_renovated'])
plt.show()
The data has two inferences. One being that most of the houses are not renovated. The second being that the renovations followed were mostly in the 2000s.
# Plotting boxplot to detect outliers
data=[(go.Box(x=house_df_new['yr_renovated'],name='yr_renovated',showlegend=True))]
plotly.offline.iplot(data)
sns.boxplot(house_df_new['yr_renovated'])
plt.show()
The boxplot shows that all the renovations being an outlier. Let's look into the number of such instances.
# creating outlier dataframe
yr_renovated_outlier=house_df_new[(house_df_new.yr_renovated>0)]
yr_renovated_outlier.shape
There are 914 outlier entitites. Let us look into its corelation with price to decide on whether to remove them or not from the analysis dataframe.
# corelation of yr_renovated with price
yr_renovated_outlier['yr_renovated'].corr(house_df_new['price'])
The corelation with price is very low. Hence, we can remove them from the analysis dataframe
#evaluating the unique entries in the variable list
house_df_new['zipcode'].sort_values().unique()
# evaluating the number of unique entries
house_df_new['zipcode'].sort_values().unique().shape
There are 70 unique zipcode entries in the dataset. On looking up the zipcodes, these are located in Seattle, Washington in the USA.
# Verifying the distribution of the variable
sns.distplot(house_df_new['zipcode'])
plt.show()
There are multiple peaks indicating clusters present in the data. Almost all zipcodes have multiple entries, indicating multiple house properties in a given area.
# Plotting boxplot to detect outliers
data=[(go.Box(x=house_df_new['zipcode'],name='zipcode',showlegend=True))]
plotly.offline.iplot(data)
sns.boxplot(house_df_new['zipcode'])
plt.show()
There are no outliers present in the dataset.
#evaluating the unique entries in the variable list
house_df_new['lat'].sort_values().unique()
# evaluating the number of unique entries
house_df_new['lat'].sort_values().unique().shape
There are 5034 unique latitude entries in the dataset.
# Verifying the distribution of the variable
sns.distplot(house_df_new['lat'])
plt.show()
There are three distinct peaks indicating 3 prominent latitude clusters present in the data.
# Plotting boxplot to detect outliers
data=[(go.Box(x=house_df_new['lat'],name='lat',showlegend=True))]
plotly.offline.iplot(data)
sns.boxplot(house_df_new['lat'])
plt.show()
As per boxplot, latitudes < 47.1622 are outliers. And thus there are two outliers present in the dataset. Let's us look in conjection with longitude details to check if they are truly outliers.
# printing all rows with suspected outliers in Latitude and checking against longitude and zipcode
house_df_new[house_df_new.lat<47.1622][['lat','long','zipcode']]
The suspected latitude coordinates were verified along with the respective longitudes and were matching to the zipcode provided. Hence, they are not cases of mis-entry and thus these data points would be retained in the dataset.
https://www.melissa.com/v2/lookups/latlngzip4/index?lat=47.1559&lng=-121.646
https://www.melissa.com/v2/lookups/latlngzip4/index?lat=47.1593&lng=-121.957
#evaluating the unique entries in the variable list
house_df_new['long'].sort_values().unique()
# evaluating the number of unique entries
house_df_new['long'].sort_values().unique().shape
There are 752 unique longitude entries in the dataset.
# Verifying the distribution of the variable
sns.distplot(house_df_new['long'])
plt.show()
There are five distinct peaks indicating 5 prominent longitude clusters present in the data.
# Plotting boxplot to detect outliers
data=[(go.Box(x=house_df_new['long'],name='long',showlegend=True))]
plotly.offline.iplot(data)
sns.boxplot(house_df_new['long'])
plt.show()
As per boxplot, all values greater that -121.821 are marked as outliers. Let's evaluate them along with latitude to check if they are truly outliers.
# printing all rows with suspected outliers in Latitude and checking against longitude and zipcode
long_outlier=house_df_new[house_df_new.long>-121.821]
long_outlier.shape
There are a total of 256 suspected outliers in longitude
# verifying unique pincodes of the suspected longitude outliers
long_outlier['zipcode'].unique()
ID: 21514 Lat: 47.4834 Long: -121.773 Zipcode: 98045
#!pip install uszipcode #to install the uszipcode package
# using the SearchEngine module in the uszipcode package
from uszipcode import SearchEngine
search = SearchEngine(simple_zipcode=True) #import only simple_zipcode package (9mb)
# Employing reverse Geocoding to evaluate the zipcode for the input lat and long
res=search.by_coordinates(47.4834,-121.773,radius=10,returns=0)
for i in range(len(res)):
print(res[i].zipcode)
# corelation of longitude with price
long_outlier['long'].corr(house_df_new['price'])
For this lat and long, the zipcode in the US database is not matching with the entries in our house database. Also, there is no correlation with price, hence, we'll remove them from the analysis dataframe.
#evaluating the unique entries in the variable list
house_df_new['living_measure15'].sort_values().unique()
# evaluating the number of unique entries
house_df_new['living_measure15'].sort_values().unique().shape
There are 777 unique renovated living measure entries in the dataset.
# Verifying the distribution of the variable
sns.distplot(house_df_new['living_measure15'])
plt.show()
The data has a normal distribution with only one peak. The distribution is slightly right skewed, indicating a possibility of outliers.
# Plotting boxplot to detect outliers
data=[(go.Box(x=house_df_new['living_measure15'],name='Living measure renovated in 2015',showlegend=True))]
plotly.offline.iplot(data)
sns.boxplot(house_df_new['living_measure15'])
plt.show()
As per boxplot, values > 3660 suggests suspected outliers.
# creating outlier dataframe
liv_meas15_outlier=house_df_new[(house_df_new.living_measure15>3660)]
liv_meas15_outlier.shape
# correlation of living measure (2015) with price
liv_meas15_outlier['living_measure15'].corr(house_df_new['price'])
There are 544 outlier entries. They have no corelation with price. Hence, they can be removed from the analysis dataframe
#evaluating the unique entries in the variable list
house_df_new['lot_measure15'].sort_values().unique()
# evaluating the number of unique entries
house_df_new['lot_measure15'].sort_values().unique().shape
There are 8689 unique renovated lot measure entries in the dataset.
# Verifying the distribution of the variable
sns.distplot(house_df_new['lot_measure15'])
plt.show()
The data has two peaks indicating two clusters. The histogram shows an abnormaly high number of instances in the lot measures of smaller size, indicating maximum presence of such properties. Also, the data is extremely right skewed indicating presence of outliers.
# Plotting boxplot to detect outliers
data=[(go.Box(x=house_df_new['lot_measure15'],name='Lot measure in 2015',showlegend=True))]
plotly.offline.iplot(data)
sns.boxplot(house_df_new['lot_measure15'])
plt.show()
As per boxplot, values > 17.55k suggests suspected outliers.
# creating outlier dataframe
lot_meas15_outlier=house_df_new[(house_df_new.lot_measure15>17550)]
lot_meas15_outlier.shape
# correlation of lot measure(2015) with price
lot_meas15_outlier['lot_measure15'].corr(house_df_new['price'])
There are 2194 outlier entries. And they have no corelation with price. Hene, they can be removed from the analysis dataframe.
#evaluating the unique entries in the variable list
house_df_new['furnished'].sort_values().unique()
# evaluating the number of unique entries
house_df_new['furnished'].sort_values().unique().shape
This variable is a categorical variable indicating if the room has been furnished or not
# Verifying the distribution of the variable
sns.distplot(house_df_new['furnished'])
plt.show()
As the data is categorical, distribution cannot be quanitfied. But based on the histogram, we can see that only a few entities are furnished, while the rest are unfurnished.
# Number of unfurnished houses
furn_no=house_df_new[house_df_new.furnished==0].shape
furn_no[0]
# Number of furnished houses
furn_yes=house_df_new[house_df_new.furnished==1].shape
furn_yes[0]
# Percentage of Furnished houses
print ('%3.2f'%(furn_yes[0]/df_size[0]*100),'%')
The data shows only 4251 houses are furnished, while the remaining is unfurnished. That is only 19.67% of the houses are furnished.
# Plotting boxplot to detect outliers
data=[(go.Box(x=house_df_new['furnished'],name='Furnished',showlegend=True))]
plotly.offline.iplot(data)
sns.boxplot(house_df_new['furnished'])
plt.show()
As the number of furnished houses are only 19.67 % of the total population in the database, they have been marked as outliers. Hence, let's evaluate the impact of furnishing against the price.
# Furnished status versus price - Pairplot visualization
sns.pairplot(house_df_new,x_vars='furnished',y_vars='price')
# Furnished status versus price - Correlation analysis
house_df_new['furnished'].corr(house_df_new['price'])
The variable furnished has a moderate positive corelation with the target price. Also, in the pairplot it can be seen that the furnished houses have a higher price. Hence, we would retain all the rows.
#evaluating the unique entries in the variable list
house_df_new['total_area'].sort_values().unique()
# evaluating the number of unique entries
house_df_new['total_area'].sort_values().unique().shape
There are 11163 unique total measure entries in the dataset.
# Verifying the distribution of the variable
sns.distplot(house_df_new['total_area'])
plt.show()
The data has a single distinguishable peak. The histogram shows an abnormaly high number of instances in the total area measures of smaller size, indicating maximum presence of such properties. Also, the data is extremely right skewed indicating presence of outliers.
# Plotting boxplot to detect outliers
data=[(go.Box(x=house_df_new['total_area'],name='total_area',showlegend=True))]
plotly.offline.iplot(data)
sns.boxplot(house_df_new['total_area'])
plt.show()
As per boxplot, values > 21942 suggests suspected outliers.
# creating outlier dataframe
total_area_outlier=house_df_new[(house_df_new.total_area>21942)]
total_area_outlier.shape
# correlation of total area with price
total_area_outlier['total_area'].corr(house_df_new['price'])
There are 2419 outlier entries. They have no corelation with price. Hence, they can be removed from the analysis dataframe.
Based on the univariate and bivariate analysis, the following decisions were taken.
1) total_area (2419 entries)
2) lot_measure15 (2194 entries)
3) room_bed -> Case B (347 entries)
4) lot_measure (2425 entries)
5) long (256 entries)
6) coast (163 entries)
7) condition (30 entries)
8) yr_renovated (914 entries)
9) cid (entire column)
# With reference to univariate inferences, converting binomial datatypes(coast, furnished) and date datatypes (sold_year,sold_month,yr_renovated) into categorical variable by one-hot coding
house_df_new=pd.get_dummies(house_df_new, columns= ['sold_year','sold_month','yr_renovated','coast','furnished'])
house_df_new.info()
house_df_new.head(5)
X=house_df_new.drop('price',axis=1)
X_corr=X.corr()
fig, ax = plt.subplots(figsize=(20,20))
sns.heatmap(X_corr,annot=True)
def get_redundant_pairs(X):
'''Get diagonal and lower triangular pairs of correlation matrix'''
pairs_to_drop = set()
cols = X.columns
for i in range(0, X.shape[1]):
for j in range(0, i+1):
pairs_to_drop.add((cols[i], cols[j]))
return pairs_to_drop
def get_top_abs_correlations(X, n=25):
au_corr = X.corr().abs().unstack()
labels_to_drop = get_redundant_pairs(X)
au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
return au_corr[0:n]
print("Top Absolute Correlations")
print(get_top_abs_correlations(X, 25))
Iot_measure and total_area are very strongly corelated. Likewise ceiling measure and living measure are highly corelated. As there are strong positive corelations between some of the input variables, we can do a PCA to reduce the dimensions
house_df_base=house_df_new.copy()
attributes = house_df_base.drop('price',axis=1)
#Finding optimal no. of clusters
from scipy.spatial.distance import cdist
from sklearn.cluster import KMeans
clusters=range(1,20)
meanDistortions=[]
for k in clusters:
model=KMeans(n_clusters=k)
model.fit(attributes)
prediction=model.predict(attributes)
meanDistortions.append(sum(np.min(cdist(attributes, model.cluster_centers_, 'euclidean'), axis=1)) / attributes.shape[0])
print (meanDistortions)
plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')
Using unsupervised learning technique - K means clustering, the number of clusters identified is 4
# K = 4
final_model=KMeans(n_clusters=4)
final_model.fit(attributes)
prediction=final_model.predict(attributes)
#Append the prediction
house_df_base["GROUP"] = prediction
print("Groups Assigned : \n")
house_df_base[["price", "GROUP"]].head(5)
house_df_base.groupby(by='GROUP',axis=0).count()
# input and target variable definition
X_base=house_df_new.drop('price',axis=1)
y_base=house_df_new['price']
print("Base X set size:", X_base.shape)
print("Base Y set size:",y_base.shape)
from sklearn.model_selection import train_test_split
X_train_base, X_test_base, y_train_base, y_test_base = train_test_split(X_base, y_base, test_size=0.30, random_state=10)
print("X_train set size:", X_train_base.shape)
print("Y_train set size:",y_train_base.shape)
print("X_train set size:", X_test_base.shape)
print("Y_train set size:",y_test_base.shape)
# Linear Regression Model
from sklearn.linear_model import LinearRegression
base_regression_model = LinearRegression()
base_regression_model.fit(X_train_base, y_train_base)
base_reg_train_acc=round((base_regression_model.score(X_train_base,y_train_base)*100),2)
print ('Train model accuracy:' ,base_reg_train_acc,'%')
base_reg_test_acc= round((base_regression_model.score(X_test_base,y_test_base)*100),2)
print ('Test model accuracy:', base_reg_test_acc,'%')
# create a panda summary dataframe of results
#data=['Base model','Linear regression',base_reg_train_acc,base_reg_test_acc]
#data=np.vstack(data)
#print(data)
column_head=['Strategy','Modelling method','Train model accuracy','Test model accuracy']
data = {'Strategy': ['Base model'], 'Modelling method': ['Linear regression'],'Train model accuracy':
[base_reg_train_acc],'Test model accuracy':[base_reg_test_acc]}
acc_df = pd.DataFrame(data,columns=column_head)
acc_df
# Base Model - Decision Tree Regression (with base dataframe)
# Decision Tree Regression model
from sklearn.tree import DecisionTreeRegressor
base_dt_model = DecisionTreeRegressor(max_depth=4,random_state=100)
# depth of 4 has been considered since the number of clusters is 4 iidentified through unsupervised learning
base_dt_model.fit(X_train_base,y_train_base)
base_dt_train_acc=round((base_dt_model.score(X_train_base,y_train_base)*100),2)
print ('Train model accuracy:' ,base_dt_train_acc,'%')
base_dt_test_acc= round((base_dt_model.score(X_test_base,y_test_base)*100),2)
print ('Test model accuracy:', base_dt_test_acc,'%')
# append to accuracy summary table
row_add=['Base model','Decision Tree Regression',base_dt_train_acc,base_dt_test_acc]
np_array=acc_df.values
np_array=np.vstack((np_array,row_add))
acc_df=pd.DataFrame(np_array,columns=column_head)
acc_df
# Random Forest Regression
from sklearn.ensemble import RandomForestRegressor
base_rtr_model=RandomForestRegressor(max_depth=4,random_state=100)
# depth of 4 has been considered since the number of clusters is 4 iidentified through unsupervised learning
base_rtr_model.fit(X_train_base,y_train_base)
base_rtr_train_acc=round((base_rtr_model.score(X_train_base,y_train_base)*100),2)
print ('Train model accuracy:' ,base_rtr_train_acc,'%')
base_rtr_test_acc= round((base_rtr_model.score(X_test_base,y_test_base)*100),2)
print ('Test model accuracy:', base_rtr_test_acc,'%')
# append to accuracy summary table
row_add=['Base model','Random Forest Regression',base_rtr_train_acc,base_rtr_test_acc]
np_array=acc_df.values
np_array=np.vstack((np_array,row_add))
acc_df=pd.DataFrame(np_array,columns=column_head)
acc_df
from sklearn import neighbors
from sklearn.metrics import mean_squared_error
from math import sqrt
rmse_val = [] #to store rmse values for different k
for K in range(20):
K = K+1
model = neighbors.KNeighborsRegressor(n_neighbors = K)
model.fit(X_train_base, y_train_base) #fit the model
pred=model.predict(X_test_base) #make prediction on test set
error = sqrt(mean_squared_error(y_test_base,pred)) #calculate rmse
rmse_val.append(error) #store rmse values
print('RMSE value for k= ' , K , 'is:', error)
#plotting the rmse values against k values
curve = pd.DataFrame(rmse_val) #elbow curve
curve.plot()
The knee appears at a k value of 2
base_knn_model = neighbors.KNeighborsRegressor(n_neighbors = 2)
base_knn_model.fit(X_train_base,y_train_base)
base_knn_train_acc=round((base_knn_model.score(X_train_base,y_train_base)*100),2)
print ('Train model accuracy:' ,base_knn_train_acc,'%')
base_knn_test_acc= round((base_knn_model.score(X_test_base,y_test_base)*100),2)
print ('Test model accuracy:', base_knn_test_acc,'%')
# append to accuracy summary table
row_add=['Base model','kNN Regression',base_knn_train_acc,base_knn_test_acc]
np_array=acc_df.values
np_array=np.vstack((np_array,row_add))
acc_df=pd.DataFrame(np_array,columns=column_head)
acc_df
In Base modelling, Random Forest regression gives a maximum score of 74.56%. However looking into the train model accuracy of 74.6%, there is a visibility to improve the model through hyperparameter tuning.
Also with kNN regression yielding the maximum train model accuracy, however resulting in the least test accuracy denoting that the impact of outlier is high. Hence, proceeding with Outlier Strategy 1
# outliers to be removed - grouping and creating a dataframe
house_df_out1=house_df_new.copy()
house_df_out1.info()
# outlier strategy 1
# droping of rows containing the index of outlier elements.
house_df_out1=house_df_out1.drop(total_area_outlier['cid'].index|lot_meas15_outlier['cid'].index|
liv_meas15_outlier['cid'].index|room_bed_outlier_1['cid'].index|
lot_measure_outlier['cid'].index|long_outlier['cid'].index|
coast_outlier['cid'].index|condition_outlier['cid'].index|
yr_renovated_outlier['cid'].index,axis=0)
# dropping 'cid' from the dataframe
house_df_out1=house_df_out1.drop('cid',axis=1)
house_df_out1.info()
# Normalizing the dataframe to evaluate the spread across the elements in a similar fashion
from scipy.stats import zscore
house_scaled_df_out_1 = house_df_out1.apply(zscore)
house_scaled_df_out_1.isnull().sum()
house_scaled_df_out_1=house_scaled_df_out_1.fillna(0)
house_scaled_df_out_1.isnull().sum()
#convert the numpy array back into a dataframe
house_scaled_df_out_1 = pd.DataFrame(house_scaled_df_out_1, columns=house_df_out1.columns)
#Evaluating the scaled dataframe
house_scaled_df_out_1.shape
X_out_1=house_scaled_df_out_1.drop('price',axis=1)
y_out_1=house_scaled_df_out_1['price']
print("Out 1 X set size:", X_out_1.shape)
print("Out 1 Y set size:",y_out_1.shape)
X_train_out_1, X_test_out_1, y_train_out_1, y_test_out_1 = train_test_split(X_out_1, y_out_1, test_size=0.30, random_state=10)
print("Out 1 X_train set size:", X_train_out_1.shape)
print("Out 1 Y_train set size:",y_train_out_1.shape)
print("Out 1 X_train set size:", X_test_out_1.shape)
print("Out 1 Y_train set size:",y_test_out_1.shape)
house_df_out1_k=house_df_out1.copy()
attributes = house_df_out1_k.drop('price',axis=1)
#Finding optimal no. of clusters
from scipy.spatial.distance import cdist
from sklearn.cluster import KMeans
clusters=range(1,20)
meanDistortions=[]
for k in clusters:
model=KMeans(n_clusters=k)
model.fit(attributes)
prediction=model.predict(attributes)
meanDistortions.append(sum(np.min(cdist(attributes, model.cluster_centers_, 'euclidean'), axis=1)) / attributes.shape[0])
print (meanDistortions)
plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')
With Outlier 1 treated model, the k means values denotes there are 5 clusters in the dataframe
out_1_regression_model = LinearRegression()
out_1_regression_model.fit(X_train_out_1, y_train_out_1)
out_1_reg_train_acc=round((out_1_regression_model.score(X_train_out_1,y_train_out_1)*100),2)
print ('Train model accuracy:' ,out_1_reg_train_acc,'%')
out_1_reg_test_acc= round((out_1_regression_model.score(X_test_out_1,y_test_out_1)*100),2)
print ('Test model accuracy:', out_1_reg_test_acc,'%')
With base modeling, Linear Regression yields an accuracy of 69%
# append to accuracy summary table
row_add=['Out 1 model','Linear Regression',out_1_reg_train_acc,out_1_reg_test_acc]
np_array=acc_df.values
np_array=np.vstack((np_array,row_add))
acc_df=pd.DataFrame(np_array,columns=column_head)
acc_df
# Decision Tree Regression model
out_1_dt_model = DecisionTreeRegressor(max_depth=5,random_state=100)
# depth of 5 has been considered since the number of clusters is 5 iidentified through unsupervised learning
out_1_dt_model.fit(X_train_out_1,y_train_out_1)
out_1_dt_train_acc=round((out_1_dt_model.score(X_train_out_1,y_train_out_1)*100),2)
print ('Train model accuracy:' ,out_1_dt_train_acc,'%')
out_1_dt_test_acc= round((out_1_dt_model.score(X_test_out_1,y_test_out_1)*100),2)
print ('Test model accuracy:', out_1_dt_test_acc,'%')
# append to accuracy summary table
row_add=['Out 1 model','Decision Tree Regression',out_1_dt_train_acc,out_1_dt_test_acc]
np_array=acc_df.values
np_array=np.vstack((np_array,row_add))
acc_df=pd.DataFrame(np_array,columns=column_head)
acc_df
# Random Forest Regression
from sklearn.ensemble import RandomForestRegressor
out_1_rtr_model=RandomForestRegressor(max_depth=5,random_state=100)
# depth of 5 has been considered since the number of clusters is 5 identified through unsupervised learning
out_1_rtr_model.fit(X_train_out_1,y_train_out_1)
out_1_rtr_train_acc=round((out_1_rtr_model.score(X_train_out_1,y_train_out_1)*100),2)
print ('Train model accuracy:' ,out_1_rtr_train_acc,'%')
out_1_rtr_test_acc= round((out_1_rtr_model.score(X_test_out_1,y_test_out_1)*100),2)
print ('Test model accuracy:', out_1_rtr_test_acc,'%')
# append to accuracy summary table
row_add=['Out 1 model','Random Forest Regression',out_1_rtr_train_acc,out_1_rtr_test_acc]
np_array=acc_df.values
np_array=np.vstack((np_array,row_add))
acc_df=pd.DataFrame(np_array,columns=column_head)
acc_df
from sklearn import neighbors
from sklearn.metrics import mean_squared_error
from math import sqrt
rmse_val = [] #to store rmse values for different k
for K in range(20):
K = K+1
model = neighbors.KNeighborsRegressor(n_neighbors = K)
model.fit(X_train_out_1, y_train_out_1) #fit the model
pred=model.predict(X_test_out_1) #make prediction on test set
error = sqrt(mean_squared_error(y_test_out_1,pred)) #calculate rmse
rmse_val.append(error) #store rmse values
print('RMSE value for k= ' , K , 'is:', error)
#plotting the rmse values against k values
curve = pd.DataFrame(rmse_val) #elbow curve
curve.plot()
The knee appears at a k value of 3
out_1_knn_model = neighbors.KNeighborsRegressor(n_neighbors = 3)
out_1_knn_model.fit(X_train_out_1,y_train_out_1)
out_1_knn_train_acc=round((out_1_knn_model.score(X_train_out_1,y_train_out_1)*100),2)
print ('Train model accuracy:' ,out_1_knn_train_acc,'%')
out_1_knn_test_acc= round((out_1_knn_model.score(X_test_out_1,y_test_out_1)*100),2)
print ('Test model accuracy:', out_1_knn_test_acc,'%')
# append to accuracy summary table
row_add=['Out 1 model','kNN Regression',out_1_knn_train_acc,out_1_knn_test_acc]
np_array=acc_df.values
np_array=np.vstack((np_array,row_add))
acc_df=pd.DataFrame(np_array)
acc_df
from sklearn.model_selection import RandomizedSearchCV
from pprint import pprint
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 1, stop = 100, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
pprint(random_grid)
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestRegressor(random_state = 42)
# Random search of parameters, using 3 fold cross validation,
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid,
n_iter = 100, scoring='neg_mean_absolute_error',
cv = 3, verbose=2, random_state=42, n_jobs=-1,
return_train_score=True)
# Fit the random search model
rf_random.fit(X_train_out_1, y_train_out_1);
rf_random.best_params_
randomcv_rtr_model=RandomForestRegressor(n_estimators=100,min_samples_split=2,min_samples_leaf=2,max_features='auto',
max_depth=90,random_state=100,bootstrap='True')
randomcv_rtr_model.fit(X_train_out_1,y_train_out_1)
randomcv_rtr_train_acc=round((randomcv_rtr_model.score(X_train_out_1,y_train_out_1)*100),2)
print ('Train model accuracy:' ,randomcv_rtr_train_acc,'%')
randomcv_rtr_test_acc= round((randomcv_rtr_model.score(X_test_out_1,y_test_out_1)*100),2)
print ('Test model accuracy:', randomcv_rtr_test_acc,'%')
# append to accuracy summary table
row_add=['Out 1 model','Randomsearch CV Forest Regression',randomcv_rtr_train_acc,randomcv_rtr_test_acc]
np_array=acc_df.values
np_array=np.vstack((np_array,row_add))
acc_df=pd.DataFrame(np_array,columns=column_head)
acc_df
#kfold validation on Random Forest Regression - Outlier Strategy 1
from sklearn.model_selection import KFold
Train_scores_1 = []
Test_scores_1 = []
cv = KFold(n_splits=10, random_state=42, shuffle=True)
for train_index, test_index in cv.split(X_train_out_1):
print("Train Index: ", train_index, "\n")
print("Test Index: ", test_index,"\n")
X_train_CV_out_1 = X_out_1.iloc[train_index]
X_train_CV_out_1, X_test_CV_out_1, y_train_CV_out_1, y_test_CV_out_1 = X_out_1.iloc[train_index],X_out_1.iloc[test_index],y_out_1.iloc[train_index], y_out_1.iloc[test_index]
kfold_rtr_model=RandomForestRegressor(n_estimators=100,min_samples_split=2,min_samples_leaf=2,max_features='auto',
max_depth=90,random_state=100,bootstrap='True')
kfold_rtr_model.fit(X_train_CV_out_1,y_train_CV_out_1)
kfold_rtr_train_acc=round((kfold_rtr_model.score(X_train_CV_out_1,y_train_CV_out_1)*100),2)
print ('Train model accuracy:' ,kfold_rtr_train_acc,'%',"\n")
kfold_rtr_test_acc= round((kfold_rtr_model.score(X_test_CV_out_1,y_test_CV_out_1)*100),2)
print ('Test model accuracy:', kfold_rtr_test_acc,'%',"\n")
Train_scores_1.append(kfold_rtr_train_acc)
Test_scores_1.append(kfold_rtr_test_acc)
Train_mean_1=np.mean(Train_scores_1)
Test_mean_1=np.mean(Test_scores_1)
Train_sc_pl_1 = pd.DataFrame(Train_scores_1)
print("Train scores:",Train_scores_1,"\n")
Test_sc_pl_1 = pd.DataFrame(Test_scores_1)
print("Test scores:",Test_scores_1,"\n")
print("Average Train score:",Train_mean_1,"\n")
print("Average Test score:",Test_mean_1,"\n")
plt.plot(Train_sc_pl_1)
plt.plot(Test_sc_pl_1)
plt.show()
# append to accuracy summary table
row_add=['Out 1 model','kfold Random Forest Regression (Mean)',round(Train_mean_1,2),round(Test_mean_1,2)]
np_array=acc_df.values
np_array=np.vstack((np_array,row_add))
acc_df=pd.DataFrame(np_array,columns=column_head)
acc_df
# Gradient Boosting
from sklearn.ensemble import GradientBoostingRegressor
Train_scores_gb = []
Test_scores_gb = []
cv = KFold(n_splits=10, random_state=42, shuffle=True)
for train_index, test_index in cv.split(X_train_out_1):
print("Train Index: ", train_index, "\n")
print("Test Index: ", test_index,"\n")
X_train_CV_out_1, X_test_CV_out_1, y_train_CV_out_1, y_test_CV_out_1 = X_out_1.iloc[train_index],X_out_1.iloc[test_index],y_out_1.iloc[train_index], y_out_1.iloc[test_index]
gb_model = GradientBoostingRegressor(n_estimators=100,min_samples_split=2,min_samples_leaf=2,max_features='auto',
max_depth=90,random_state=100)
gb_model.fit(X_train_CV_out_1,y_train_CV_out_1)
gb_model_train_acc=round((gb_model.score(X_train_CV_out_1,y_train_CV_out_1)*100),2)
print ('Train model accuracy:' ,gb_model_train_acc,'%',"\n")
gb_model_test_acc= round((gb_model.score(X_test_CV_out_1,y_test_CV_out_1)*100),2)
print ('Test model accuracy:', gb_model_test_acc,'%',"\n")
Train_scores_gb.append(gb_model_train_acc)
Test_scores_gb.append(gb_model_test_acc)
Train_mean_gb=np.mean(Train_scores_gb)
Test_mean_gb=np.mean(Test_scores_gb)
Train_sc_pl_gb = pd.DataFrame(Train_scores_gb)
print("Train scores:",Train_scores_gb,"\n")
Test_sc_pl_gb = pd.DataFrame(Test_scores_gb)
print("Test scores:",Test_scores_gb,"\n")
print("Average Train score:",Train_mean_gb,"\n")
print("Average Test score:",Test_mean_gb,"\n")
# append to accuracy summary table
row_add=['Out 1 model','kfold Gradient Boosting Regression(mean)',round(Train_mean_gb,2),round(Test_mean_gb,2)]
np_array=acc_df.values
np_array=np.vstack((np_array,row_add))
acc_df=pd.DataFrame(np_array)
acc_df
# Outlier Strategy 2
# creating new dataframe with suspected outliers to be removed based on model accuracy
house_df_out2=house_df_new.copy()
house_df_out2=house_df_out2.drop(total_area_outlier['cid'].index|lot_meas15_outlier['cid'].index|
liv_meas15_outlier['cid'].index|room_bed_outlier_1['cid'].index|
lot_measure_outlier['cid'].index|long_outlier['cid'].index|
coast_outlier['cid'].index|condition_outlier['cid'].index|
yr_renovated_outlier['cid'].index|room_bath_outlier['cid'].index|
room_bed_outlier['cid'].index|living_measure_outlier['cid'].index|
sight_outlier['cid'].index|quality_outlier['cid'].index|
room_bath_outlier['cid'].index,axis=0)
# dropping 'cid' from the dataframe
house_df_out2=house_df_out2.drop('cid',axis=1)
house_df_out2.info()
# Normalizing the dataframe to evaluate the spread across the elements in a similar fashion
house_scaled_df_out_2 = house_df_out2.apply(zscore)
house_scaled_df_out_2.info()
house_scaled_df_out_2=house_scaled_df_out_2.fillna(0)
#convert the numpy array back into a dataframe
house_scaled_df_out_2 = pd.DataFrame(house_scaled_df_out_2, columns=house_df_out2.columns)
#Evaluating the scaled dataframe
house_scaled_df_out_2.shape
X_out_2=house_scaled_df_out_2.drop('price',axis=1)
y_out_2=house_scaled_df_out_2['price']
print("Out 2 X set size:", X_out_2.shape)
print("Out 2 Y set size:",y_out_2.shape)
X_train_out_2, X_test_out_2, y_train_out_2, y_test_out_2 = train_test_split(X_out_2, y_out_2, test_size=0.30, random_state=10)
print("Out 2 X_train set size:", X_train_out_2.shape)
print("Out 2 Y_train set size:",y_train_out_2.shape)
print("Out 2 X_train set size:", X_test_out_2.shape)
print("Out 2 Y_train set size:",y_test_out_2.shape)
# kfold Random Forest Regression with Outlier Strategy 2
Train_scores_2 = []
Test_scores_2 = []
cv = KFold(n_splits=10, random_state=42, shuffle=True)
for train_index, test_index in cv.split(X_train_out_2):
print("Train Index: ", train_index, "\n")
print("Test Index: ", test_index,"\n")
X_train_CV_out_2 = X_out_2.iloc[train_index]
X_train_CV_out_2, X_test_CV_out_2, y_train_CV_out_2, y_test_CV_out_2 = X_out_2.iloc[train_index],X_out_2.iloc[test_index],y_out_2.iloc[train_index], y_out_2.iloc[test_index]
kfold_rtr_model=RandomForestRegressor(n_estimators=100,min_samples_split=2,min_samples_leaf=2,max_features='auto',
max_depth=90,random_state=100,bootstrap='True')
kfold_rtr_model.fit(X_train_CV_out_2,y_train_CV_out_2)
kfold_rtr_train_acc=round((kfold_rtr_model.score(X_train_CV_out_2,y_train_CV_out_2)*100),2)
print ('Train model accuracy:' ,kfold_rtr_train_acc,'%',"\n")
kfold_rtr_test_acc= round((kfold_rtr_model.score(X_test_CV_out_2,y_test_CV_out_2)*100),2)
print ('Test model accuracy:', kfold_rtr_test_acc,'%',"\n")
Train_scores_2.append(kfold_rtr_train_acc)
Test_scores_2.append(kfold_rtr_test_acc)
Train_mean_2=np.mean(Train_scores_2)
Test_mean_2=np.mean(Test_scores_2)
Train_sc_pl_2 = pd.DataFrame(Train_scores_2)
print("Train scores:",Train_scores_2,"\n")
Test_sc_pl_2 = pd.DataFrame(Test_scores_2)
print("Test scores:",Test_scores_2,"\n")
print("Average Train score:",Train_mean_2,"\n")
print("Average Test score:",Test_mean_2,"\n")
plt.plot(Train_sc_pl_2)
plt.plot(Test_sc_pl_2)
plt.show()
# append to accuracy summary table
row_add=['Out 2 model','kfold Random Forest Regression (Mean)',round(Train_mean_2,2),round(Test_mean_2,2)]
np_array=acc_df.values
np_array=np.vstack((np_array,row_add))
acc_df=pd.DataFrame(np_array,columns=column_head)
acc_df
# kfold Random Forest Regression with Base Dataframe
Train_scores_base = []
Test_scores_base = []
cv = KFold(n_splits=10, random_state=42, shuffle=True)
for train_index, test_index in cv.split(X_train_base):
print("Train Index: ", train_index, "\n")
print("Test Index: ", test_index,"\n")
X_train_CV_base = X_base.iloc[train_index]
X_train_CV_base, X_test_CV_base, y_train_CV_base, y_test_CV_base = X_base.iloc[train_index],X_base.iloc[test_index],y_base.iloc[train_index], y_base.iloc[test_index]
kfold_rtr_model=RandomForestRegressor(n_estimators=100,min_samples_split=2,min_samples_leaf=2,max_features='auto',
max_depth=90,random_state=100,bootstrap='True')
kfold_rtr_model.fit(X_train_CV_base,y_train_CV_base)
kfold_rtr_train_acc=round((kfold_rtr_model.score(X_train_CV_base,y_train_CV_base)*100),2)
print ('Train model accuracy:' ,kfold_rtr_train_acc,'%',"\n")
kfold_rtr_test_acc= round((kfold_rtr_model.score(X_test_CV_base,y_test_CV_base)*100),2)
print ('Test model accuracy:', kfold_rtr_test_acc,'%',"\n")
Train_scores_base.append(kfold_rtr_train_acc)
Test_scores_base.append(kfold_rtr_test_acc)
Train_mean_base=np.mean(Train_scores_base)
Test_mean_base=np.mean(Test_scores_base)
Train_sc_pl_base = pd.DataFrame(Train_scores_base)
print("Train scores:",Train_scores_base,"\n")
Test_sc_pl_base = pd.DataFrame(Test_scores_base)
print("Test scores:",Test_scores_base,"\n")
print("Average Train score:",Train_mean_base,"\n")
print("Average Test score:",Test_mean_base,"\n")
plt.plot(Train_sc_pl_base)
plt.plot(Test_sc_pl_base)
plt.show()
# append to accuracy summary table
row_add=['Base model','kfold Random Forest Regression (Mean)',round(Train_mean_base,2),round(Test_mean_base,2)]
np_array=acc_df.values
np_array=np.vstack((np_array,row_add))
acc_df=pd.DataFrame(np_array,columns=column_head)
acc_df
kfold_rtr_model=RandomForestRegressor(n_estimators=100,min_samples_split=2,min_samples_leaf=2,max_features='auto',
max_depth=90,random_state=100,bootstrap='True')
kfold_rtr_model.fit(X_train_CV_out_1,y_train_CV_out_1)
feature_importances = pd.DataFrame(kfold_rtr_model.feature_importances_,
index = X_train_CV_out_1.columns,
columns=['importance']).sort_values('importance', ascending=False)
feature_importances.head(15)
# retaining only the top 15 features in the dataframe
house_df_fimp=house_df_out1.copy()
house_df_fimp.shape
house_df_fimp=house_df_fimp[['lat','living_measure','furnished_0','quality','furnished_1','long','living_measure15',
'yr_built','ceil_measure','zipcode','lot_measure15','sold_date_full','sight','total_area','sold_date','price']]
house_df_fimp.shape
# creating input and output variables
X_imp=house_df_fimp.drop('price',axis=1)
y_imp=house_df_fimp['price']
print("Feature improved X set size:", X_imp.shape)
print("Feature improved Y set size:",y_imp.shape)
X_train_imp, X_test_imp, y_train_imp, y_test_imp = train_test_split(X_imp, y_imp, test_size=0.30, random_state=10)
print("X_train set size:", X_train_imp.shape)
print("Y_train set size:",y_train_imp.shape)
print("X_train set size:", X_test_imp.shape)
print("Y_train set size:",y_test_imp.shape)
# kfold Random Forest Regression with feature importance applied to Outlier Strategy 1
Train_scores_1_imp = []
Test_scores_1_imp = []
cv = KFold(n_splits=10, random_state=42, shuffle=True)
for train_index, test_index in cv.split(X_train_imp):
print("Train Index: ", train_index, "\n")
print("Test Index: ", test_index,"\n")
X_train_CV_out_1_imp = X_imp.iloc[train_index]
X_train_CV_out_1_imp, X_test_CV_out_1_imp, y_train_CV_out_1_imp, y_test_CV_out_1_imp = X_imp.iloc[train_index],X_imp.iloc[test_index],y_imp.iloc[train_index], y_imp.iloc[test_index]
kfold_rtr_model=RandomForestRegressor(n_estimators=100,min_samples_split=2,min_samples_leaf=2,max_features='auto',
max_depth=90,random_state=100,bootstrap='True')
kfold_rtr_model.fit(X_train_CV_out_1_imp,y_train_CV_out_1_imp)
kfold_rtr_imp_train_acc=round((kfold_rtr_model.score(X_train_CV_out_1_imp,y_train_CV_out_1_imp)*100),2)
print ('Train model accuracy:' ,kfold_rtr_imp_train_acc,'%',"\n")
kfold_rtr_imp_test_acc= round((kfold_rtr_model.score(X_test_CV_out_1_imp,y_test_CV_out_1_imp)*100),2)
print ('Test model accuracy:', kfold_rtr_imp_test_acc,'%',"\n")
Train_scores_1_imp.append(kfold_rtr_imp_train_acc)
Test_scores_1_imp.append(kfold_rtr_imp_test_acc)
Train_mean_1_imp=np.mean(Train_scores_1_imp)
Test_mean_1_imp=np.mean(Test_scores_1_imp)
Train_sc_pl_1_imp = pd.DataFrame(Train_scores_1_imp)
print("Train scores:",Train_scores_1_imp,"\n")
Test_sc_pl_1_imp = pd.DataFrame(Test_scores_1_imp)
print("Test scores:",Test_scores_1_imp,"\n")
print("Average Train score:",Train_mean_1_imp,"\n")
print("Average Test score:",Test_mean_1_imp,"\n")
plt.plot(Train_sc_pl_1_imp)
plt.plot(Test_sc_pl_1_imp)
plt.show()
# append to accuracy summary table
row_add=['Out 1 model','feature imp kfold Random Forest Regression (Mean)',round(Train_mean_1_imp,2),round(Test_mean_1_imp,2)]
np_array=acc_df.values
np_array=np.vstack((np_array,row_add))
acc_df=pd.DataFrame(np_array,columns=column_head)
acc_df
acc_df['Train model accuracy'],acc_df['Test model accuracy']=acc_df['Train model accuracy'].astype('float64'),acc_df['Test model accuracy'].astype('float64')
acc_df.pivot(index="Strategy",columns="Modelling method",values="Test model accuracy").plot(kind='bar',title="Test model accuracy",ylim=[70,100],yticks=[70,75,80,85,90,95,100],grid=True,figsize=(20,20))
As seen in the above summary graph, the best benchmark in the base modeling is by kfold Random forest Regression model (10 folds) with a mean test accuracy of 87.14%.
With outlier strategy 1, Random search CV modeling methodlogy [3 folds for each of 100 candidates, totalling 300 fits] has resulted in the best overall accuracy of 88.13%.
However, to elimiate bias, kfold Random forest regression with 10 folds was performed on the outlier strategy 1 resulted in a mean test accuracy of 87.5%.
Feature importance on this methodology was performed to reduce the overall dimensions to 15, however there was no improvment observed in the modeling accuracy. Yet, considering the reduced computational effort required, this method would be adopted for the production model.
With EDA on the subject dataframe, key insights were drawn to the target variable 'Price'. Also, two outlier Strategies were formulated to deal with increasing the modeling accuracy. The absolute target accuracy was decided to by 85 to 90%.
Post which, the base dataframe was evalauted against different regression alogorithms and ensemble algorithms to set the modeling benchmark.
Each of the outlier strategy was evalauted for the various algorithms, and the best performing algorithm was subjected to hyper paratemeter tuning and kfold cross validation to improve the modeling accuracy. Finally feature importance was applicated to the best performing model:
import pickle
# Dump the feature engineered dataframe structure with Pickle
def feat_extract(dataframe):
#feature engineering steps
house_df_new=dataframe.copy()
house_df_new['sold_date_full']=house_df_new['dayhours'].str[:8].astype('int64')
house_df_new['sold_date']=house_df_new['dayhours'].str[6:8].astype('int64')
house_df_new=pd.get_dummies(house_df_new, columns= ['furnished'])
house_df_fimp=house_df_new[['lat','living_measure','furnished_0','quality','furnished_1','long','living_measure15',
'yr_built','ceil_measure','zipcode','lot_measure15','sold_date_full','sight','total_area','sold_date','price']]
#dump in pickle model
pickle.dump(house_df_fimp,open("feature_engg.pkl","wb"))
return house_df_fimp
feat_extract(house_df)
def evaluate_model(dataframe):
#input and output variable assignment
X=dataframe.drop('price',axis=1)
y=dataframe['price']
# predict using model
from sklearn.ensemble import RandomForestRegressor
kfold_rtr_model=RandomForestRegressor(n_estimators=100,min_samples_split=2,min_samples_leaf=2,max_features='auto',
max_depth=90,random_state=100,bootstrap='True')
kfold_rtr_model.fit(X,y)
accuracy = round((kfold_rtr_model.score(X,y)*100),2)
#Productionize model
pickle.dump(kfold_rtr_model, open('model.pkl', 'wb'))
return accuracy
evaluate_model(house_df_fimp)